Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning

نویسندگان

  • Mike Kestemont
  • Jeroen De Gussem
چکیده

In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning. Introduction and challenges The Latin language, and its historic variants in particular, have long been a topic of major interest in Natural Language Processing [Piotrowksi 2012]. Especially in the community of Digital Humanities, the automated processing of Latin texts has always been a popular research topic. In a variety of computational applications, such as text re-use detection [Franzini et al, 2015], it is desirable to annotate and augment Latin texts with useful morpho-syntactical or lexical information, such as lemma’s. In this paper, we will focus on two sequence tagging tasks for medieval Latin: part-ofspeech tagging and lemmatization. Given a piece of Latin text, the task of lemmatization involves assigning each word to a single dictionary headword or ‘lemma’: a baseform label (preferably in a normalized orthography) grouping all word tokens which only differ in spelling and/or inflection [Knowles et al, 2004]. The task of lemmatization is closely related to that of part-of-speech (PoS) tagging [Jurafsky et al, 2000], in which each word in a running text should be assigned a tag indicating its part of speech or word class (e.g. noun, verb, ...). The difficulty of PoS-tagging strongly depends of course on the complexity and granularity of the tagset chosen. Lemmatization and PoS-tagging are classic forms of sequence labeling, in which tags are assigned to words, both on the basis of their individual appearance, as well as the other words which surround them. While both lemmatization and PoS-tagging are rather basic preprocessing steps, they are generally complicated by a number of interesting challenges which the Latin language poses. First of all, while plain stemming might take us a long way [Schinke et al, 1996], many Latin suffixes cannot be automatically linked to an unambiguous morphological category. Words ending in –ter, for example, correspond to no less than six different parts of speech: nouns (fra-ter), adjectives (dex-ter), pronouns (al-ter), adverbs (gravi-ter), numeral adverbs (qua-ter) and prepositions (in-ter) [Manuel de lemmatisation, LASLA, 2013]. Additionally, like many other languages, Latin is teeming with homographs which require context to be disambiguated. A token such as legi can both be lemmatized under the verb lego as under the noun lex. Similarly ambiguous tokens include common forms such quae, satis or venis. For lemmatization specifically, another problem are verb forms which show no resemblance to their lemma at all. The fact that tuli is an ‘active 1st person singular perfect’ of fero is not at all obvious, and the same goes for fero’s perfect participle latus, which could in its own turn be confused with the homonymous common noun latus (“side”). A tagger has to learn the morphological connection between tuli, latus and fero by moving beyond outward appearances (prefixes, word stems or suffixes), and by properly modelling the immediate context surrounding these words.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Named Entity Recognition in Persian Text using Deep Learning

Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...

متن کامل

Detecting Overlapping Communities in Social Networks using Deep Learning

In network analysis, a community is typically considered of as a group of nodes with a great density of edges among themselves and a low density of edges relative to other network parts. Detecting a community structure is important in any network analysis task, especially for revealing patterns between specified nodes. There is a variety of approaches presented in the literature for overlapping...

متن کامل

Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization models

We present a survey of tagging accuracies — concerning part-of-speech and full morphological tagging — for several taggers based on a corpus for medieval church Latin (see www.comphistsem.org). The best tagger in our sample, Lapos, has a PoS tagging accuracy of close to 96% and an overall tagging accuracy (including full morphological tagging) of about 85%. When we ‘intersect’ the taggers with ...

متن کامل

Deep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning

Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...

متن کامل

A Structured SVM Semantic Parser Augmented by Semantic Tagging with Conditional Random Field

This paper presents a novel method of semantic parsing that maps a natural language (NL) sentence to a logical form. We propose a semantic parsing method by conducting separately two steps as follows; 1) The first step is to predict semantic tags for a given input sentence. 2) The second step is to build a semantic representation structure for the sentence using the sequence of semantic tags. W...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1603.01597  شماره 

صفحات  -

تاریخ انتشار 2016